Self-heal pods wedged by transient gcsfuse-sidecar metadata-server timeouts#286
Open
morgan-wowk wants to merge 1 commit into
Open
Self-heal pods wedged by transient gcsfuse-sidecar metadata-server timeouts#286morgan-wowk wants to merge 1 commit into
morgan-wowk wants to merge 1 commit into
Conversation
Collaborator
Author
This stack of pull requests is managed by Graphite. Learn more about stacking. |
…meouts GKE-injected gke-gcsfuse-sidecar pods can wedge in CreateContainerConfigError when the sidecar's bucket-access-check pre-flight times out resolving Workload Identity via a degraded metadata server (exit 255). The GCS volume never mounts, the main container never starts, and the orchestrator polls the pod until the run-level timeout cancels it with no logs — burning a whole run on a transient platform fault. Detect that signature and self-heal by relaunching the task in place: - launchers/interfaces.py: replace the no-op try_self_heal() hook with a detection predicate transient_infra_failure_reason() -> str | None (default None, so other launchers are unaffected). - kubernetes_launchers.py: implement the predicate for the gcsfuse-sidecar wedge (sidecar exit 255 + transient reason + bucket-access-check message, guarded by the main container not having started). No pod-spec surgery. - orchestrator_sql.py: when a running execution reports a transient infra failure, terminate the wedged pod, mark its ContainerExecution SYSTEM_ERROR (which records the failed attempt and excludes it from cache reuse), and re-queue the same ExecutionNode within the same run so the canonical launch path builds a fresh pod. Capped at 2 retries (tracked on the node's extra_data); beyond that the node fails and downstream is skipped so the run fails fast. Same run, same node, same provenance — no new run, no pod surgery. Tests cover the launcher detection signature and the orchestrator re-queue / retry-cap behaviour end to end against an in-memory DB.
279723d to
13dd31c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Problem
GKE-injected
gke-gcsfuse-sidecarpods can wedge inCreateContainerConfigErrorwhen the sidecar's bucket-access-check pre-flight times out resolving Workload Identity via a degraded metadata server (sidecar exits 255). The GCS volume never mounts, the main container never starts, and the orchestrator polls the pod until the run-level timeout cancels it with no logs — burning a whole run (e.g. a daily eval) on a transient platform fault. There is no execution-level auto-retry today.Approach: re-queue in place
When the wedge is detected on a running execution, the orchestrator:
ContainerExecutionSYSTEM_ERROR— this records the failed attempt for forensics and removes it from cache-reuse candidates (reuse only considers PENDING/RUNNING/SUCCEEDED).ExecutionNodewithin the same run, so the canonicallaunch_container_taskpath builds a fresh, correct pod (usually landing on a healthier node).Capped at 2 retries, tracked on the node's
extra_data. Beyond the cap the node is failed (SYSTEM_ERROR) and its downstream skipped, so the run fails fast instead of hanging until its timeout.Same run, same node, same provenance — no surprise sibling run, no pod-spec surgery.
Why this design
Two earlier approaches were rejected:
skipCSIBucketAccessCheck=true(oasis-backend #405) — proven ineffective on driverv1.21.24-gke.5in staging (wrong knob; it skips the CSI-node check, not the sidecar check that hits the metadata server).Changes
launchers/interfaces.py— replace the no-optry_self_heal()hook with a detection predicatetransient_infra_failure_reason() -> str | None(defaultNone, so other launchers are unaffected).launchers/kubernetes_launchers.py— implement the predicate for the gcsfuse-sidecar wedge (sidecar exit 255 + transient reason + "bucket access check" message, guarded by the main container not having started). Detection-only; no pod mutation.orchestrator_sql.py— dispatch to_handle_transient_infra_failure(...)from the running-execution path; terminate +SYSTEM_ERROR+ re-queue with the 2-retry cap.Model / persistence / UX
ContainerExecution(FK); on re-queue the FK is repointed to the fresh execution at relaunch. The wedged execution persists as an unreferencedSYSTEM_ERRORrow (forensics). No logs lost — a wedged pod never produced container logs.container_execution, so after relaunch the tile follows the fresh pod and ends green. The retry is auditable via the node's autocontainer_execution_status_historyand the standaloneSYSTEM_ERRORrow.Tests
transient_infra_failure_reason()returns a reason for the wedge signature andNoneotherwise (running/terminated main, clean sidecar exit, unrelated message, no sidecar).SYSTEM_ERROR, the node returns toQUEUEDwith retry count 1 and the reason recorded; exceeding the cap flips the node toSYSTEM_ERRORand skips downstream.Verified:
pytest,black, import smoke all green.